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condition  often  leads  to  an  estimator  that  attains  the  semiparametric  efficiency  bound. 
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1.  Introduction 

Two  step  estimators  are  useful  for  a  variety  of  models,   including  sample  selection 
models  and  models  that  depend  on  expectations  of  economic  agents.     Estimators  where  the 
first  step  is  nonparametric  are  particularly  important,   having  many  applications  in 
econometrics,   and  providing  a  natural  approach  to  estimation  of  parameters  of  interest. 
The  purpose  of  this  paper  is  to  derive  the  form  of  an  asymptotically  efficient  two  step 
estimator,   given  a  first  step  estimator. 

The  efficient  estimator  for  a  given  first  step  nonparametric  estimator  will  often  be 
fully  efficient,   attaining  the  semiparametric  efficiency  bound  for  the  model,   as  we  show 
for  some  sample  selection  models  considered  by  Newey  and  Powell  (1993).     Full  efficiency 
occurs  because  the  first  step  is  just   identified,   analogous  to  the  efficiency  of  a 
limited  information  estimator  of  a  simultaneous  equation  when  all  the  other  equations  are 
just  identified.      An  analogous  result  for  two-step  parametric  estimators  is  given  in 
Crepon,   Kramarz,   and  Trognon  (1997),  where  optimal  estimation  in  the  second  step  leads  to 
full  efficiency  if  the  first  step  is  exactly  identified. 

We  will  first  give  some  general  results  that  characterize  second  step  estimators  that 
are  efficient  in  a  certain  class  and  consider  construction  of  estimators  that  are 
approximately  efficient.      We  then  derive  the  form  of  efficient  estimators  in  several 
specific  models,   including  conditional  moment  restrictions  that  depend  on  functions  of 
conditional  expectations  and  sample  selection  models  where  the  propensity  score  (i.e.   the 
selection  probability)   is  nonparametric.      We  also  describe  how  an  approximately  efficient 
estimator  could  be  constructed  by  optimally  combining  many  second  step  estimating 
equations. 

Throughout  the  paper  we  rely  on  the  results  of  Newey  (1994)  to  derive  the  form  of 
asymptotic  variances  and  make  efficiency  comparisons.     Those  results  allow  us  to  sidestep 
regularity  conditions  for  asymptotic  normality  and  focus  on  the  issue  at  hand,   which  is 
the  form  of  an  efficient  estimator.      In  this  approach  we  follow  long  standing  econometric 


practice  where  efficiency  comparisons  are  made  without  necessarily  specifying  a  full  set 
of  regularity  conditions.      Of  course,   we  could  give  general  regularity  conditions  for 
specific  estimators  (e.g.   as   in  Newey  (1994)  for  series  estimators  or  Newey  and 
McFadden  (1994)  for  kernel  estimators),   but  this  would  detract  from  our  main  purpose. 

As  an  initial  example,   consider  the  following  simple  model  in  which  the  conditional 
mean  of  a  dependent  variable     y     given  some  conditioning  variable     x     is  proportional  to 
its  conditional  standard  deviation: 


y  =  8  cr(x)  +  u,     E[u|x]  =  0,     cr2(x)   =  Var(y|x).  (1) 


Given  a  sample     <(y.,   x'. )',    i   =  1,    ...,   n>     of  observations  on     y     and     x,     one  type  of 
estimator  for     8        would  be  an  instrumental  variables   (IV)  estimator  replacing     cr(x) 
with  a  nonparametric  estimator     cr(x)     and  using  an  instrument     a(x)     to  solve  the 
equation 


£.,a(x.)[y.  -  |3£(x.)]  =  0 
^i=l       l    Ji  l 


(2) 


**•  n  "         — 1     n 

for     8  =   [Y.    ,a(x.)c(x.)]     Y.    ,a(x.)y..      For  example,   the  least  squares  estimator  would 
^1=1       l         l        ^1=1       11 

have     a(x)  =  c(x).      If  the  data  generating  process  and  the  nonparametric  estimator     <r(x) 
are  sufficiently  regular,   so  that     J3     is  root-n  consistent  and  asymptotically  normal, 
then  the  formulae  given  in  Newey  (1994)  can  be  used  to  derive  the  following  form  of  the 
asymptotic  distribution  for     /3: 

v^n(/3  -  Bq)  -iU  N(0,   <E[a(x)cr(x)]r2-E[<2a(x)2]),  (3) 

C  =  u  -  [2<r(x)]_1{u2  -  o-(x)2). 

The  asymptotic  variance  of  this  estimator  is  that  of  an  IV  estimator  with  instrument 

a(x),     regressor     c(x)     and  residual     ^.     The  efficient  choice  of  instrument,   as  in 

2 
Chamberlain  (1987)  is     cr(x)/w(x)     for     w(x)  =  E[<;    |x].     The  novel  feature  of  this  optimal 


instrument  is  that  it  depends  on  the  inverse  of  the  conditional  variance  of     <;     rather 

than  the  original  residual     u.      If     v  =  u/cr(x)     is  independent  of     x     then     <  =  o-(x)[v  - 

2  2  2 

(v  -l)/2],      so  that     E[C    lxl     is  proportional  to     cr  (x),     and  the  the  best   instrument  is 

l/cr(x),     the  same  as  if  the  first  stage  estimation  were  not  accounted  for.      In  general 

though,    it  is  necessary  to  account  for  the  first-stage  estimator  in  forming  the  optimal 

instrument  for  the  second  stage. 

The  best  IV  estimator  is  weighted  least  squares  with  weight     u(x)     .      As  in  Newey 

(1993),   estimation  of  the  optimal  instrument  should  not  affect  the  asymptotic  variance, 

so  the  weighted  least  squares  estimator,   so  for     w(x)     that  is  suitably  well  behaved, 


j3  =  [y.%(x.)   ^(x.)2]   V.^wCxJ   l£(x.)y.  (4) 

^i=l       l  l  ^i=l        l  li 


should  be  efficient.      Alternatively,   as  we  discuss  below,   an  approximately  efficient 
estimator  could  be  constructed  by  GMM  estimation  with  moment  conditions     A(x)[y-/3(r(x.)], 
where     A(x)     is  some  vector  of  approximating  functions. 


2.  General   Methods 

To  describe  the  general  class  of  estimators  we  consider,   let     m(z,G,a)     denote  a 
vector  of  functions,   where     z     denotes  a  single  data  observation,     9     is  a  parameter 
vector,   and     a     is  an  unknown  function.     Suppose  that  the  moment  restrictions 


E[m(z,e0,ct0)]  =  0  (5) 


are  satisfied,   where  the     "0"     subscript  denotes  true  values.     This  moment  restriction 
and  an  estimator     a     of     a       can  be  used  to  construct  an  estimator     6     of     0    ,     by 


solving  the  equation 


£."  m(z.,e,a)/n  =  0.  (6) 

The  class  of  estimators  we  consider  are  of  this  form,   where     m     is  restricted  to  be  in 

some  set  of  feasible  moment  functions,   and     a     may  depend  on     m.      In  the  example  of 

Section  1,     z  =  (y,   x')',     6  =  |3,     a  =  <r     denotes  the  conditional  standard  deviation 

function,   and     m(z,6,a)   =  o-(x)[y  -  |3cr(x)]. 

To  characterize  the  optimal     m(z,/3,a)     in  some  class,   i.e.   the  one  where     8     has  the 

smallest  asymptotic  variance,   we  need  the  asymptotic  distribution  of     0.      In  general,   the 

asymptotic  variance  will  depend  on  the  form  of     a,     but  for  the  moment,   we  will   simply 

assume  that  for  each     m(z,8,a)     there  is  an  associated  function     u    (z)     such  that 

m 

rn,m(z.,en,i)/v^  =  £.%    (z.Wn  +  o   (1).  (7) 

^i=l        i     0  *"i=l  m    i  p 

Here     u    (z)      is  the  influence  function  of     Y.    ,m(z.,G„,a)/n     with  the  term 
m  ^i=l        l     0 

u    (z)-m(z,9    ,a    )     accounting  for  the  presence  of     a.      When     a     is  nonparametric  the 

results  of  Newey  (1994)   can  be  used  to  derive     u    (z).      If  equation   (7)  holds  along  with 

other  regularity  conditions  then  for     H      =  <3E[m(z,6,a    )]/3e|  , 

m  u  0— 9 

o 

vWe  -  6J  -^  N(0,  H_1E[u    (z)u    (z)']H"1').  (8) 

0  m        m        m  m 

The  efficient  two-step  estimator  we  will  consider  is  one  where     m     minimizes  the 

asymptotic  variance     H     E[u    (z)u    (z)']H      '. 

m        m        m  m 

A  sufficient  condition  for     m(z,9,a)     to  minimize  the  asymptotic  variance  of     6     is 
that  for  all     m, 


H      =  E[u    (z)u-(z)'].  (9) 

m  mm 

When  this  equation  holds  then  by  Newey  and  McFadden  (1994),   a  lower  bound  on  the 

asymptotic  variance  will  be     (E[u— (z)u— (z)' ])     ,     and  will  be  attained  when     m  =  m. 

m        m 

Equation  (9)   is  analogous  to  the  generalized  information  matrix  equality  in  parametric 


models,   and  similar  equations  have  been  used  by  Hansen  (1985a),   Hansen,   Heaton,   and  Ogaki 

(1988),   and  Bates  and  White  (1993)  to  find  efficient  estimators.     Here  we  use  this 

equation  to  derive  the  optimal  choice  of  a  second  step  estimator. 

This  characterization  of  an  efficient  two-step  estimator  can  be  used  to  derive  the 

optimal  estimator  in  the  initial  example.      In  that  example,     u    (z)  =  a(x)C     and     H      = 

mm 

-E[a(x)cr(x)].      Also,   the  choice  of     m     reduces  to  a  choice  of  instrument     a     and 

2-  - 

equation  (9)   is     E[a(x)cr(x)]  =  E[a(x)^  a(x)]  =  Eia(x)w(x)a(x)].      A  solution  to  this 

equation,   that  is  hence  an  optimal  instrument  is     a(x)   =  a>(x)     cr(x). 

Construction  of  an  efficient  estimator  can  often  be  based  on  the  solution     m(z,/3,oc) 
to  equation   (9).      Although     m     may  depend  on  unknown  functions  other  then     a,      they 
can  often  be  replaced  by  parametric  or  nonparametric  estimators  without  affecting  the 
efficiency  of     e,      e.g.   as  in  Newey  (1993).      Estimators  which  are  efficient  for  some 
restricted  class  of  unknown  distributions,   referred  to  as  locally  efficient  here,   can  be 
constructed  by  using  finite  dimensional  parameterizations  of  unknown  components  of  the 
optimal  moment  function.      Estimators  which  are  efficient  for  all  distributions  can  be 
constructed  by  using  nonparametric  methods  to  estimate  unknown  components.     In  the 
examples  to  follow  we  will  discuss  various  estimators  of  the  optimal  moment  functions, 
which  will  result  in  efficiency  under  appropriate  regularity  conditions. 

A  general  approach  to  efficient  estimation,   which  is  useful  when     m     is  complicated 
and  it  is  hard  to  form  an  explicit  estimate,   is  to  use  the  efficient  generalized  method 
of  moments  estimator  based  on  "many"  moment  functions.     This  approach  has  been  considered 
by  Beran  (1976),   Hayashi  and  Sims  (1983),   Chamberlain  (1987),   and  Newey  (1993).      Under  a 
"spanning"  condition,   this  approach  will  result  in  an  estimator  that  is  approximately 
efficient,   in  the  sense  that  as  the  number  of  moments  grows,  the  asymptotic  variance  of 
the  estimator  approaches  that  of  the  optimal  estimator. 

To  be  precise,   consider  a     J  x  1     vector  of  functions     m  (z,G,a)     (where     a     may 

depend  on     J).      Suppose  that  for  some     J  x  1     vector     u  (z),     equation  (7)   is  satisfied 

with     m       and     u       replacing     m     and     u        respectively,   and  let     V       denote  an  estimator 
j  j  m  j 


of     V     =  E[u  (z)u  (z)']     (e.g.   V     =  £      u  (z.)u ,(z.)'  /n     for  an  estimator     u  (z)).      An 
optimal  GMM  estimator  based  on  the  moment  vector     m       is 

0J  =  argmin06QIi21mJ(z.,e)a)'V-1I.21mJ(z.)e,a).  (10) 

An  alternative  one-step  version  is,   for     H     =  £.     3m  (z.,6,a)/ae/n, 

6.  =  6  -  (H'TV~1HI)"1H'IV"1^.n]mI(z.,e>a)  (11) 

J  J    J     J         J    J  ^1=1    J     l 

where  6  is  an  initial  estimator.  As  usual,  the  one-step  estimator  is  asymptotically 
equivalent  to  its  full  optimization  counterpart.  Both  estimators  will  have  asymptotic 
variance     (H'V     H  )     ,      which  can  be  estimated  by     (H'.V.  H  )     . 

As     J     gets  larger  the  asymptotic  variance  of  this  estimator  will  approach  the  lower 
bound,   if  linear  combinations  of     u  (z)     can  approximate  the  optimal  influence  function 
u— (z)     in  mean  square,   as  shown  by  the  following  result. 

Theorem  2.1:     Suppose  that  there  is     m     such  that     E[u—(z)u—(z)'  ]     is  nonsingular,     H      = 

E[u    (z)u — (z)'  ]     for  all   feasible     m,     and  there  are  conformable  constant  matrices     C  T,     J 
mm  J 

=  1,  2,  ...     such  that     E[\\u— (z)-C  Tu  T(z)\\2]  — >  0     as     J  — »  w.     Then     (H'V~Hr)      — > 

m  J  J  J  J     J 

(Elu-(z)u-(z)'  D'1     as     J  — »  co. 
m         m 


The  mean-square  approximation  hypothesis  of  this  result  is  the  spanning  condition 

referred  to  above.      This  result  falls  short  of  an  efficient  estimation  result,   because  it 

does  not  specify  a  way,   independent  of  the  true  data  generating  process,   such  that     J 

grow  with  the  sample  size  so  that     &,,   ,     has  asymptotic  variance     (E[u— (z)u— (z)' ])     . 

J(n)  mm 

It  is  possible  to  give  such  rates  in  particular  problems,   as  in  Newey  (1993),   but 
to  avoid  technical  detail  rates  are  not  derived  here.      Instead,   we  focus  on  how  this 
result  suggests  efficient  estimation  might  approximately  be  achieved,   by  choosing  moment 
functions  with  corresponding     u  (z)     that  approximate     u— (z)     in  mean-square. 


3.        Conditional  Moment  Restrictions  and  Nonparametric  Generated  Regressors. 

The  first  specific  model  we  consider  is  a  semiparametric  instrumental  variables 
estimator  that  depends  on  a  nonparametric  regression.     The  introductory  example  is  a 
special  case  of  this  model.      Also,   this  case  includes  estimators  that  have  been 
considered  by  Ahn  and  Manski   (1993)  and  Rilstone  (1989).      Let     a(x)     denote  a  vector  of 
instrumental  variables,     p(z,9,a)     a  residual  that  depends  on  a  function     a,     where 


E[p(z,e0>a0)|x]  =  0,     aQ(w)  =  E[d|w].  (12) 


Then  the  moment  vectors  will  consist  of  a  vector  of  instrumental  variables     a(x) 
multiplying  the  residual, 

m(z,6,a)  =  a(x)p(z,0,a). 

The  optimality  problem  here  is  finding  the  set  of  instrumental  variables  that  minimizes 
the  asymptotic  variance. 

To  derive  the  optimal  instruments  we  need  to  account  for  the  nonparametric 
estimation,   which  can  be  done  by  imposing  the  following  condition.      Let     <x(w,y) 
denote  a  parametric  specification  for  the  conditional  expectation,   satisfying  regularity 
conditions,   along  with  the  conditional  moment  vector,   so  that  the  derivatives  in  the 
following  condition  exist. 

Assumption  3.1:     There  is     5(w)     such  that  for  all     a(w,^)     with     a   (w)  =  a(w,^    )     for 
some     y_,      dE[a(x)p(z,e.,cd»)]/3r  I  =  E[a(x)5(w)aa(w,^)/a?-]. 

This  condition  leads  to  correction  terms  for  estimation  of     a.     of  the  simple  form 
E[a(x)  |  w]5(w)[d  -  a   (w)],     as  derived  in  Newey  (1994).     The  value  of  this  approach  is 
that  it  is  based  on  a  simple  derivative  calculation  that  is  easy  to  apply.     For  instance, 


2 

in  the  initial  example,     where     a_(x)  =  E[(y  ,y)'  |x]     and     p(z,9   ,a)  = 

2  1/2 
y   -  |3{a  (x)   -   [a  (x)]   }       ,     Assumption  3.1   is  satisfied  with     5(w)  = 

-/3   [2cr   (x)]     (l,-2E[y |x]),      leading  to  the  correction  term  given  earlier. 

The  first  optimal  instrument  question  we  address  is  for  the  case  where  the 

instruments     x     are  a  subset  of  the  first  stage  regressors.     Let     D(x)  = 

E[5p(z,9    ,<x   )/S9|x],     and  for  now  assume  that     p     is  a  scalar. 

Theorem  3.1:     If     x  £  w     and  Assumption  3.1  is  satisfied,  then 

u    (z)  =  a(xK,     <  =  p(z,6n,an)  +  8(w)[d  -  gn(w)], 
m  u    u  u 

and  the  choice  of   instruments  that  minimizes  the  asymptotic  variance  is 

a(x)  =  D(x)'(E[C2\x])~1. 

An  interesting  interpretation  of  these  optimal  instruments  follows  upon  noting  the  form 

of  the  influence  function,     u    (z)  =  a(x)C,     which  is  the  instrumental  variables  times  an 

m 

"adjusted  residual"     ^,     that  accounts  for  the  presence  of     a.     Consequently,  the 

optimal  instruments  are  the  same  as  for  an  IV  estimator  without  the  first  stage, 

2  2 

except  that  the  conditional  variance     E[<    |x]     has  replaced     E[p    |x].     The  reason  that 

the  optimal  instruments  have  this  simple  form  is  that  the  adjusted  residual  fully 

accounts  for  the  generated  regressors. 

The  initial  example  provides  one  illustration  of  this  case.      Another  example   is  the 

estimator  of  Ahn  and  Manski  (1993).     Suppose  there  is  a  binary  dependent  variable     y  6 

{0,   1>,     with 

Prob(y=l|x)  =  $(eio[cc0(x,0)-<x0(x,l)]  +  x^e^),     a^w)  =  E[d|w],     w  =   (x,y),      (13) 

where     $     is  the  CDF  for  a  standard  normal.     Their  estimator  is  probit,   with  a 
nonparametric  estimator     a(w)     replacing     a  (w).     As  usual  for  probit,   this  estimator  is 


asymptotically  equivalent  to  an  instrumental  variables  estimator  with 


p(z,e,a)  =  y  -  $(9  [a(x,0)-a(x,D]  +  x'^), 


(14) 


a(x)  =  D(xMx)   \     Q(x)  =  *(v)[l  -  *(v)],   D(x)  =  -(a   (x.O)-a   (x.D.x'  )' <piv), 


where     v  =  6     [a   (x,0)-a   (x,l)]    :■  x'6         and     <p{v)     is  the  standard  normal  p.d.f..     These 
instruments  are  not  optimal. 

To  derive  the  optimal  instruments,   note  that     x  £  w,      so  that  we  can  apply  Theorem 
3.1.     To  do  so,   we  note  that  for  any  functions     b(x)     and     d(w),     E[b(x)<d(x,0)-d(x,l)}] 

=  Elb(x)d(w)U(l-y)/(l-*(v))]  -  [y/*(v)]>.     Then,   by     9p(z,9     a(y))/ay|    _       = 

-0     i/>(v)[5a(x,0,3')/33'  -  9<x(x,l,y)/3y],      Assumption  3.1  is  satisfied  for     5(w)   = 

-9     0(v){[l-$(v)]     (1-y)  -  <Mv)     y>.     Then,   as  in  Theorem  3.1,     <;  =  y-$(v)  - 

2 
0     </>(v)5(w)[d-g   (w)],      so  for     u(x)   =  E[{d-E[d|w]}    |x],     the  optimal   instruments  are 

a(x)  =  (E[^2|x])_1D(x)  (15) 

E[C2|x]   =  *(v)[l-*(v)]   +  ^(v)28  ^(vf^l-ttvjr'utx). 


As  in  the  last  example,   the  optimal  instruments  are  those  for  weighted  nonlinear  least 
squares,   where  the  weight   is  the  inverse  of  the  conditional  variance  of  an  "adjusted 
residual,"  rather  than  the  original  residual. 

An  optimal  estimator  can  be  constructed  by  using  estimates  of  the  optimal 
instruments.      As  usual,   estimating  instruments  will  not  affect  the  asymptotic  variance. 
The  optimal  instruments  may  be  estimated  by  substituting     9  ,     a(x),   and     v     for     9     , 
a(x),     and     v     respectively  in  the  formula  for  the  optimal  instruments,   and  also 
substituting  an  estimator  for     w(x).      A  locally  efficient  estimator  can  be  constructed  by 

estimating     w(x)     as  the  predicted  value  from  a  regression  of  nonparametric  squared 

2 

residuals     [d-a(w)]        on  a  few  functions  of     x.      A  fully  efficient  estimator  would 
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require  nonparametric  estimation  of     w(x).      Alternatively,   an  approximately  efficient 
estimator  could  be  constructed  from  GMM  estimation  using  many  functions  of     x,      as 
considered  below. 

Another  example  of  Theorem  2.1  is  the  semiparametric  panel  probit  estimator  of  Newey 
(1994),   where     y  ,      (t  =  1,   2),     are  binary  variables  and  there  is  an  unknown  function 
h(x)     such  that 

E[yt|x]   =  <H[x^0  +  h(x)]/(rt),      (t  =  1,   2),     <r2  =  1. 


Inverting  the  normal  CDF  and  differencing  eliminates  the  unknown  function     h(x)     and 
gives  the  condition     p(x,6    ,a   )   =  0,     where     «n(x)  =  (E[y  |x],   E[y    |x])',      9  = 
(/3',cr),      and 

p(x,e,cx)  =  #     (a  (x))  -  o*  $     [a Ax))  -  (x  -x  )'/3. 


The  least  squares  estimator  in  Newey  (1994)  is  obtained  by  substituting  a  nonparametric 
regression  estimator     a(x)     for     a   (x)     and  minimizing     £._  p(x.,6,ct)   .      As  usual  for 
least  squares,   this  is  asymptotically  equivalent  to  an  IV  estimator  with  instruments 
a(x)  =  D(x)'    =  3p(x,e0,a0)/ae  =  -((x2-x)',$_1(a    (x)))'. 

The  optimal  instruments  can  be  derived  by  applying  Theorem  3.1.     Let     i//(a)  = 
</>($     (a))  In  this  example     5(x)  =  (i//(a     (x)),-cr    i//(a     (x))).     Therefore,   from  Theorem 

3.1  we  have     C,  =  5(x)<y  -  E[y|x]>.     Let     w(x)  =  5(x)Var(y |x)5(x)'    =  Var(^lx).     Then  the 
optimal  instruments  are 


r 


a(x)  =  D(x)'w(x) 


X2"X1 
$_1(a10(x)) 


/Var(0(a2O(x))y2-o-loi//(alo(x))y1 1  x). 


These  instruments  correspond  to  the  first  order  condition  for  the  weighted  least  squares 
estimator 

n  -1  *  2 

0  =  argmin  V.    ,u(x.)     p(x.,6,a)  /n 
6^1=1       i  i 
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This  estimator  can  also  be  thought  of  as  a  semiparametric  minimum  distance 
estimator,   where     6     is  being  chosen  to  minimize  a  function  that  should  converge  to  zero 
at  the  truth.      The  characterization  of  an  optimal  IV  estimator  applies  to  any 
semiparametric  minimum  distance  problem  where     <*n(x)  =  E[y|x]     for  a  vector     y     and 
p(x,60,a0)  =  0.      For     D(x)  =  dp{yi,BQ,aQ)/dQ     and     w(x)  =  <5(x)Var(y  |  x)S(x)' ,     the  optimal 
instruments  will  be     D(x)'w(x)     .      Furthermore,   the  weighted  least  squares  estimator  will 
be  optimal  in  this  class. 

Construction  of  an  efficient  estimator  is  straightforward  in  the  case  where     x  £  w. 
The  optimal  instruments  in  Theorem  2.1  can  be  estimated  nonparametrically,   proceeding 
analogously  to  Newey  (1993).      Alternatively,   an  approximately  efficient  estimator  can  be 
constructed  by  GMM  estimation  using  a  vector  of  approximating  functions     A(x)     as 
instruments,   where  the  moment  functions  would  be     A(x)p(z,6,a).      In  this  case,   where 
the  influence  function  is  so  simple,  the  spanning  condition  of  Theorem  2.1  just  requires 
that  a  linear  combination  of     A(x)     can  approximate  the  optimal  instruments.     For  brevity 
we  omit  a  formal  result. 

The  next  case  we  consider  is  that  where     w  £  x.     This  case  is  more  complicated  in 
that  the  correction  for  the  first  stage  estimation  does  not  lead  to  the  adjusted  residual 
form  of  the  influence  function.      Let     p  =  p(z,0    ,a   ),     Q(x)  =  E[pp'  |x],     V  = 
5(w)[d-a0(w)],     S(w)  =  E[VV'|w],     and     K(x)  =  E[pV'|x]. 


L2 


Theorem  3.2:     If     w  £  x     and  Assumption  3.1   is  satisfied,  then 


u    (z)  =  a(x)p  +  E[a(x)\w]V. 
m 


Also,   if  the  linear  equations 

a(x)  =  [D(xY    +  P(w)'    +  R(w)'  K(x)'  ]Q(x)~1. 

P(w)  =  -Z(w)E[a(x)'  \w]  -  E[K(x)'a(x)'  \w],     R(w)  =  -E[a(x)'  \w], 

have  a  solution  for     a(x),     R(w),     and     P(w)     with  probability  one,  then  the  optimal 
instruments  are     a(x).     Furthermore,   if     K(x)  =  0     then     R(w)  =  0     and 

P(w)  =  -{Z(w)'1  +  E[Q(x)~2\w]f1E[Q(x)~1D(x)\w]. 

In  general  the  form  of  the  optimal  instruments  is  quite  complicated,   although  it 
simplifies  in  the  zero  conditional  covariance  case,     K(x)  =  0. 

An  example  where  Theorem  3.2  applies  is  a  nonparametric  generated  regressor  model 
where 


p(z,9,a)  =  y  -  z'e     -  ya(w)  -  5[d  -  a(w)],     a   (w)  =  E[d|w],      (d,w)  £  x. 


This  residual  is  that  for  a  linear  model,   where  a  conditional  expectation  and  the 
residual  from  the  same  conditional  expectation  are  included  as  regressors.     The  model   is 
a  semiparametric  version  of  a  familiar  model  with  many  economic  applications,   that  has 
been  considered  in  Rilstone   (1989).      Note  here  that     Dfx)  =  -(E[z'  |x],   a   (w),   d-a   (w)), 
K(x)  =  0     and  that  Assumption  3.1  is  satisfied  for     5(w)  =  5   -y    .     Therefore,   the  optimal 
instruments  are 

a(x)   =  D(x)'n(x)_1  +  E[D(x)'Q(x)_1|w]<Z(w)-1  +  E[Q(x)_1 1  wlT^tx)"1. 
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In  the  case  where     fi(x)     and     Z(w)     are  constant,   this  formula  simplifies  to 


a(x)  = 


fE[z    I  xfl 

a   (w  ) 
o 

d-a    (  w) 
o 


Q       + 


z  +  n 


fE[zi  Iwll 

a   (w  ) 
o 

0 


D* 


and  it  can  be  shown  that  the  resulting  estimator  attains  Rilstone's  (1989)  semiparametric 
bound  for  normal  disturbances.      It  is  interesting  to  note  that  these  instruments  are 
equal  to  those  for  the  best  estimator  if     a   (w)     were  used  in  place  of     a,     plus  a  term 
involving  the  additional  variable     E[z  |w],      If     E[z  |w]  =  E[z  |x],     e.g.   as  would  occur 
if     z     £  w,     then  the  best  instruments  would  be  a  linear  combination  of  the  instruments 
that  are  best  without  the  generated  regressor  problem.      Indeed,    if     z     £  w,     then  it 
follows  that  least  squares  is  best. 

In  the  constant     Q(x)     and  Z(w)     case,      an  estimator  of  the  optimal  instruments  is 
readily  available  from  replacing     E[z  |x],     E[z   |w],     a   (w),     Z,     and     Q     by 
corresponding  estimates.     Alternatively,   since  the  optimal  instruments  are  a  linear 
combination  of     A(x)   =  (E[z'|x],   E[z'|w],   a   (w),   d-a   (w))',     for  the  corresponding     A(x) 
an  optimal  estimator  could  be  obtained  from  GMM  with  an  optimal  weighting  matrix  that 
accounts  for  the  generated  regressors. 

In  the  general  model  of  Theorem  3.2  it  should  be  possible  to  construct  an  efficient 
estimator  by  using  nonparametric  estimates  of  the  optimal   instruments.      Alternatively,   an 
approximately  efficient  estimator  can  be  constructed  using  many  moment  conditions.      Here 
that  corresponds  to  GMM  estimation  with  many  functions  of     x     as  instruments.      It  is 
straightforward  to  give  conditions  for  approximate  efficiency  of  these  estimators.      Let 
A  (x)  =  (A    (x),...,A     (x))'      be  a  vector  of  functions  of     x,     and  consider  a  GMM 
estimator  as  in  equation   (10)  or   (11)   with     m  (z,0,a)  =  p(z,0,a)®A  (x).     Let     r     be  the 
dimension  of     p. 


14 


_         2 

Theorem  3.3:     If     E[u—(z)u—(z)']     is  nonsingular,     E[\\a(x)\\    ]  <  oo,     Q(x)     and     S(w)     are 

mm 

2 

bounded,  and  for  any     a(x)     with     E[\\a(x)\\   ]  <  m,   there  exists     C     such  that 

E[\\a(x)-C  J I  ®A.(x)}\\2]  — »  0     as     J  — >  m,     then     (H'T  V~}h  Sl  ->  (E[u-(z)u-(z)'  ])~2     as     J 
J     r     J  J  J     J  mm 


The  point  of  this  result  is  that  all  that  is  needed  to  guarantee  approximation  of  the 
complicated  optimal   influence  function  in  Theorem  3.2,   leading  to  approximate  efficiency, 
is  that  the  instruments  can  approximate  functions  of     x.     There  are  many  types  of 
instruments  that  would  meet  this  qualification,   including  power  series  and  regression 
splines,   making  it  relatively  easy  to  construct  an  approximately  efficient  estimator. 

Here  we  have  shown  the  form  of  an  efficient  two-step  GMM  estimator  when  the  second 
step  instruments  are  a  subset  of  the  first  step  regressors  or  vice  versa.     It  is  also 
possible  to  obtain  some  results  for  the  case  where     w     and     x     are  not  a  subset  of  each 
other.     For  brevity  these  results  are  omitted. 


4.        Sample  Selection  Models  and  Nonparametric  Propensity  Score  Estimation 

Interesting  and  important  two-step  estimators  arise  in  the  context  of  sample 
selection  models.      A  general  form  of  such  a  model   is 

*  * 

y     =  x'6     +  e,     y       only  observed  if     d  =  1,     d  e  <0,   1}.  (16) 

Prob(d  =  l|w)  =  ?r(w)  =  P,      x  £  w. 

The  selection  probability     P     is  referred  to  as  the  propensity  score.     The  parameters  of 
interest     9       are  identified  under  various  restrictions  on  the  joint  distribution  of  the 
selection  indicator     d     and  the  disturbance     e.     In  this  section  we  consider  the  form  of 
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optimal  two-step  estimators  in  two  cases,   where     c     and     w     are  mean  independent 
conditional  on     d  =  1     and     P     and  where  they  are  statistically  independent.      The 
estimators  are  two-step  estimators  where  the  first  step  is  nonparametric  estimation  of 
the  propensity  score.      Some  such  estimators  have  been  previously  considered  by  Ahn  and 
Powell  (1993)  and  Choi   (1990). 

The  first  case  we  consider  is 

E[e|w,d=l]  =  E[e|P,d=l],  (17) 

Here  the  conditional  mean  of  the  disturbance,   given  selection  and     w,     depends  only  on 
the  propensity  score.      This  can  be  motivated  by  a  latent  variable  model  where     d  = 
1(t(w)+tj  i  0)     for  some  unknown  function     t(w)     and  a  disturbance     tj.      Suppose  that     P 
is  a  one-to-one  function  of     t(w)     (e.g.      7)     and     w     are  independent  and     tj     has  density 
that  is  positive  everywhere)     and  that     E[c|t),w]  =  E[e  |t),t(w)].     Then 

E[e|w,d=l]   =  E[e  |  w,7)i-T(w)]  =  E[E[e  |tj,w]  |  w,t)£-t(w)]  =  E[E[e  |t),t(w)]  |  w,t)2:-t(w)],     which  is 
a  function  of     w     only  through     x(w),     and  hence  through     P,      so  that  equation  (17) 
holds. 

A  two  step  estimator  of     0        can  be  constructed  using  a  nonparametric  propensity 
score  estimator     P  =  tt(w).     A  vector  of  instrumental  variables     a(w)     can  be  used  along 
with  a  residual     d<y-E[y  |  P,d=l]  -  (x-E[x  |  P,d=l])'  0},     where     E[*|P]     is  a  nonparametric 
regression  estimator  with  regressor     P,     to  form  a  moment  vector 

m(z,8,a)  =  a(w)d<y-a  (a   (w))  -  [w-a    (a  (w))]'6>,     a  =  (a  ,   a    ,   *).      (18) 
y     ir  w     7r  y       w 

Here     a       and     a        have  true  values     E[y|a  (w)]     and     E[w|a  (w)],     and     a   (w)     has  true 

y  W  TT  71  77 

value     tt(w).      Using  this  function  with     a     as  we  have  described  is  a  two-step 
instrumental  variables  version  of  Robinson's  (1988)  estimator,   and  a  density  weighted 
version  has  been  developed  by  Ahn  and  Powell   (1993). 

It  is  straightforward  to  derive  the  influence  function  from  the  results  of  Newey 
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(1994)  and  to  obtain  the  optimal   instruments.      Let     A(P)  =  E[e|w,d=l]     and     A      = 

ax(P)/ap. 


Theorem  4.1:     The   influence  function  is 


u    (z)  =  {a(w)-E[a(w)\P]K,     <  =  dlc-X(P)]  +  PAD(PXd-P;. 
m  r 

For     -q(w)2  =  E[C,2\w]  =  E[d{c-X(P)}2 \w]  +  A    (P)2P3(1-P)     bounded  away  from  zero,  the 
optimal   instruments  are 

a(w)  =  P-q(w)'2{x  -  E[-n(w)~2x\P]/E[-n(wf2\P]}. 


Furthermore,     Var(u—(z))         is  the  semiparametric  variance  bound  for  estimation  of     6„ 
m  U 

in  the  model   of  equation  (17). 


The  best  instruments  here  are  like  those  that  appear  in  an  efficient  estimator  for  a 
heteroskedastic  partially  linear  model   as  discussed  in  Chamberlain  (1992).     They  are 
obtained  from  partialling  out  an  unknown  function  of     P     in  a  weighted  least  squares 
criterion,   where  the  weight   is  the  inverse  of  the  conditional  variance  of     C,.     The  main 
difference  with  Chamberlain  (1992)   is  that  the  presence  of     P     leads  to  the  weight  being 
the  inverse  conditional  variance  of  the  adjusted  residual     t;,     rather  than  the  original 
residual     p.      In  this  way  the  optimal  instruments  account  for  the  presence  of  the  first 
stage  nonparametric  estimates. 

Because  the  optimal  instruments  attain  the  semiparametric  efficiency  bound  we  do  not 
need  to  search  beyond  instrumental  variables  estimation  for  an  efficient  estimator. 

Construction  of  an  optimal  estimator,   or  approximately  optimal  estimator  could  be  carried 

2 
similar  to  the  way  discussed  in  Section  2.      A  nonparametric  estimator     7)(w)     = 

E[d{y-x'8-A(P)}    |w]  +  A   (P)  P  (1-P)     could  be  constructed  using     6     and     A     from  an 

initial  unweighted  least  squares  estimator  of  a  partially  linear  model     y  =  x'0  +  A(P)  + 

r     in  the  selected  data,   and  then  a  nonparametric  estimator  of  the  optimal  instruments 
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formed  as 


a(w)   =  pi)(w)   2<x  -   E[tj(w)   2x|P]/E[^(w)   2|P]}. 


This  is  a  complicated  estimator  for  which  regularity  conditions  have  not  yet  been 
formulated  in  the  literature,   but  should  lead  to  efficiency. 

An  approximately  efficient  estimator  is  straightforward  to  construct  here,   by  using 
approximating  functions  as  instruments  and  then  doing  optimal  GMM  that  accounts  for  the 
presence  of  the  nonparametric  first  stage.     Let     A  (w)     be  a  vector  of  approximating 
functions  and  consider  a  GMM  estimator  as  in  equation  (10)  or  (11)  with     m  (z,6,a)   = 
A  (w)d{y-E[y  |P,d=l]  -  (x-E[x|  P,d=l])' 6}.      Then  there  is  a  relatively  simple  spanning 
condition  for  approximate  efficiency  of  the  GMM  estimator: 

Theorem  4.2:     If     E[u—(z)u—(z)']     is  nonsingular,     ~n(w)     is  bounded  and  for  any     a(w) 

2 
with  finite  mean-square,  there  exists     C       such  that     E[\\a(w)-C   A   (w)\\   ]  — >  0     as     J  — 

co,     then     (H\V~}h  Sl  — >  (E[u-(z)u-(z)'  ])~2     as     J  — >  oo. 
J   J     J  mm 

Here  we  see  that  it  suffices  for  approximate  efficiency  that  the  approximating  vector 

A  (w)     spans  the  set  of  functions  with  finite  mean-square.     There  are  many  such  functions 

that  could  be  used,   including  splines  and  power  series. 

The  second  sample  selection  model  we  consider  satisfies  equation  (16)  with 

e     and     w     are  independent  conditional  on     P     and     d  =  1.  (19) 

Here  it  is  assumed  that  conditioning  on     P     removes  any  dependence  between     c     and     w. 
This  can  be  motivated  by  a  latent  variable  model  like  that  above,   where     d  = 
1(t(w)+t}  £  0).      If  the  joint  distribution  of     (c,T))     given     w     depends  only  on     x(w) 
then  this  equation  will  be  satisfied. 

A  basic  implication  of  the  conditional  independence  in  equation  (19)   is  that  any 
function  of     w     will  be  uncorrected  with  any  function  of     e     and     P,     conditional  on     P 
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and     d  =  1.     This  allows  us  to  form  estimators  analogous  to  those  above,   where     y-x'6     is 
replaced  by  any  function  of     y  -  x'0     and     P.     To  be  precise,   for  a  vector  of 
instrumental  variables     a(w),     a  function     q(e,P),     and  a  corresponding  residual 
d{q(y-x'e,P)  -  E[q(y-x'  0,P)  |  P,d=l ]>,      we  can  use  the  moment  function 

m(z,e,a)  =  a(w)d[q(y-x'e,a   (w))  -  a  (y-x'6, a  (w))],     a  =  (a  ,  a  ).        (20) 

tt  q  J  n  qrc 

Here     a        and     a       have  true  values     E[q(y-x'0,ct   (w))|a   (w),d=l]     and     tt(w), 
q  tt  TT  71 

respectively.      Using  this  function  with     a     as  we  have  described  is  a  nonlinear  version 
of  the  estimator  described  above  for  the  conditional  mean  case. 

The  next  result  gives  the  form  of  the  influence  function  and  the  optimal  moment 
functions  for  this  estimator.     Here  let     MP)  =  E[q(e,P)  |  P,d=l],     A   (P)  =  5A(P)/3P,     and 
f(e|P)     the  density  of     c     given     d  =  1     and     w.     Also,   set     s   (e,P)  =  dlnf(e|P)/de     and 
s   (e,P)  =  dlnf(e|P)/dP     be  the  score  for     f     with  respect  to  a  location  parameter  and     P. 

Theorem  4.3:     For  the  model   of  equation  (19)  and  moment  functions  as  in  equation  (20), 
the  influence  function  is 

um(z)  =  {a(w)-E[a(w)\P,d=l]}C,,     C,  =  d[q(c,P)-X(P)]  +  P{E[qp(c,P)\P,d=l]-Xp(P)}(d-P), 

and  the  optimal  choice  of  moment  function  has     a(w)  =  x     and 

q(c,P)  =  s  (c,P)  -  PsJc,P)E[s  s    \P,d=U/(P~l(l-P)~1  +  PE[s2  \P,d=l]}. 
c  P  c  p  p 

Furthermore,     Var(u—(z))        is  the  semiparametric  variance  bound  for  estimation  of     Q  . 

Using  many  moment  functions  in  an  approximately  efficient  estimator  is  useful  in 
this  case,   where  the  best     q(c,P)     is  quite  complicated,   but  the  optimal  instrument     a(w) 
=  x     has  a  known  functional  form.      In  this  case  approximate  efficiency  can  be  achieved 
with  only  a  two-dimensional  approximation  of  the  best     q(e,P),     rather  than  the 
potentially  high  dimensional  approximation  of  the  best  instruments     a(w)     from  Theorem 
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4.1.     The  low  dimensional  nature  of  the  approximation  means  that  it  should  be  possible  to 
attain  high  efficiency  using  a  relatively  small  set  of  moment  conditions.      Let     q  (e,P) 
denote  a  vector  of  approximating  functions,   and  consider  a  GMM  estimator  as  in  equation 
(10)  or  (11)  with     m  (z,9,a)  =  d^ty-x'  9,P)   -  Elq^y-x'  9,P)  |  P,d=l]}©x. 

P  2 

Theorem  4.4:     If     E[u—(z)u—(z)']     is  nonsingular,     x     and     E[(s   )    \P,d=l]     are  bounded, 

2  2 

and  for  any     E[dq(c,P)  ]     finite  there  is     C       such  that  E[d{q(c,P)-C   q   (c,P)}  ] 

J  U    J 

-^  0     as     J  -^  oo,     then     (H'.V'^H  j'1  -^  (E[u-(z)u-(z)'  })~l  as     J  -^  ». 

J  J     J  mm 


In  practice  it  may  make  sense  to  begin  the  approximation  with  functions  of     e     and 
P     that  correspond  to  particular  distributions.     For  example,   one  could  derive  the  form 
of     q(e,P)     when     c     and     t)     have  a  normal  distribution.     Alternatively,   one  could  use  a 
few  simple  approximating  functions,   like  power  series  in  some  function  of     e.     Because 
selection  is  likely  to  induce  some  skewness  and  because  normality  is  not  expected  in  many 
econometric  applications,   using  such  an  estimator  for  the  conditional  independence  case 
of  equation  (19)  can  result  in  substantial  efficiency  gains  over  the  linear  estimators 
based  on  equation  (18),   as  shown  in  Newey  (1991)  for  a  semiparametric  selection 
probability. 
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APPENDIX:  Proofs  of  Theorems 


Let     C     denote  a  constant  that  is  different  in  different  uses. 

1/2 
Proof  of  Theorem  2.1:      For  a  matrix     B     let      IIBII   =  [tr(B'B)]       .      By  equation  (9),      H 

=  Efu.u — '  ],     where  we  suppress  the     z     argument  for  notational  convenience,   so  that     6. 
J  m  J 

has  asymptotic  variance 

(H'.V^HJ"1  =   (Elu-u'.KElu.u'jf^tu.u-'lf1  =  (Elu.u'jf1, 
J    J     J  mJ  JJ  Jm  JJ 

u  =  Elu-u'.KEIu.u'JfV, 
m   J  J   J  J 


For  the     C       in  the  statement  of  the  result,   let     u      =  C  u  .      Since     u     is  the 
multivariate  least  squares  projection  of     u—     on     u  ,      it  follows  that 

E[u _u'  ]  £  Etuu'  ]  £  Efu-u-'  I.  (21) 

C  C  mm 


Also,  by  the  spanning  condition  in  the  statement, 


IIE[u_u']  -  E[u— u— ']H   s  E[llu_u'-u— u— '  II] 
C  C  mm  C  C     m  m 

£  E[llu   -u-ll2]   +  2(E[llu   -u-ll2])1/2(E[llu-ll2])1/2  -»  0. 
Cm  Cm  m 


Therefore,   by  equation  (21),      E[uu'  ]  — >  E[u— u— '  ]     as     J  — >  oo,      so  the  conclusion  follows 

m  m 

by  nonsingularity  of     E[u— u— '  ]     and  continuity  of  the  inverse  matrix  at  any  nonsingular 
matrix.     QED. 


Proof  of  Theorem  3.1:     The  form  of  the  influence  function  follows  by  Newey  (1994).     The 
optimal  instruments  follow  as  in  Chamberlain  (1987). 

Proof  of  Theorem  3.2:     The  form  of  the  influence  function  follows  by  Newey  (1994).      For 
notational  convenience  suppress  the     x     argument  of     a(x),     fi(x),     D(x),     and     K(x). 
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Then  by  iterated  expectations,   equation  [9)  for  the  optimal  influence  function,   and 
hence  the  optimal  instruments,    is 


H      =  E[aD]  =  E[a<Qa'    +  Z(w)E[a'  |w]   +  KE[a'  |w]  +  E[K'a'  |w]>]. 
m 


Since     a  =  a(x)     can  be  any  bounded  function  of     x,     satisfaction  of  this  equation 

requires  that     D  =  Qa'    +  Z(w)E[a'  |w]  +  KE[a'  |w]  +  E[K'a'  |w]  =  Qa'    -  P(w)  -  KR(w). 

Solving  for     a     then  gives  the  the  second  result.      For  the  third  result,   let     P      =  P(w) 

and     Z      =  Z(w).      Applying  the  optimal  instrument  formula  gives     a  =  (D'    +  P'  )fi     ,     P 

w  WW 

-Z    E[a'  |w].      Plugging  in  the  formula  for     a     in  the  equation  for     P        gives 

P      =  -Z    E[n_1(D  +  P    )|w]  =  -Z    E[Q-1D|w]  -  Z    E[n_1|w]P    .      Solving  for     P        then 
w  W  W  W  WW  w 

gives  the  third  result.     QED. 

Proof  of  Theorem  3.3:      Let     a,(x)  =  I  ®a,(x).     Note  that  for  any     u    (z)     as  in  Theorem 

J  r     J  m 

3.1  or  Theorem  3.2,    and  conformable  constant  matrix     C  , 

E[llu    (z)-u_       (z) II2] 
m  Cm 

J    J 

£  2E[lla(x)-C  a  (x)Ml2IIQ(x)ll]  +  2E[IIE[{a(x)-CJa](x)>  |  w]ll2|IZ(w)ll  ] 

£  CE[lla(x)-C  a  (x)}ll2]  +  CEtEtllatxl-Cja^x)!!2!  w]]  <  CElllatxKCjajtx)}!!2]. 


The  conclusion  then  follows  by  Theorem  2.1.     QED. 


For  the  proof  of  Theorems  4.1  -  4.4,   for  any  function     b     of  the  data  let     E  b  = 

E[b|P,d=l],      and  note  that  for  functions     b(w)     of     w,     E  b  =  E[b|P]. 

P 

Proof  of  Theorem  4.1:      As  shown  in  Newey  (1994),   the  correction  terms  for  nonparametric 

estimation  can  be  derived  separately  for     a  ,     a    .     and     a  .     From  the  form  of  the 

y        w  tt 

conditional  expectations  correction  in  Newey  (1994)  it  follows  that  the  correction  term 

for     a       and     a        is     -E[a(w)  |P,d=l]p.     Also,   let     jrtw,^)  =  E  [d|w]     denote  the 
y  vv  y 
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propensity  score  when  for  a  distribution  parameterized  by     y     that  passes  through  the 
truth.     Then  by  Newey  (1994)  the  correction  term  for  nonparametric  estimation  of     7i(w) 
can  be  computed  from  the  derivative  of     E[da(w)E[e  |Tr(w,y),d=l]]     with  respect  to     y     at 
the  truth.     By  equation  (21),   iterated  expectations,  the  chain  rule,   and  the  fact  that 
dnlvr.r-Vdy  =  SE  [d|w]/3y  =  E[{d-P}S  (z)|w]     for  the  score     S   (z)     for     z, 

U  ^f  If  0 

SE[da(w)E[c|n(w,^-),d=l]]/a9-  =  aE[da(w)E[MP)|ir(w,y),d=l]]/5y 
=  SE[da(w)E[A(P-Ti(w,y)+P)|P,d=l]]/ay  +  aE[da(w)M7r(w,y))]/3r 


-  E[PAD(P){a(w)-E[a(w)|P]}a7r(w,^)/33']  =  E[PA   (P){a(w)-E[a(w)|P]}{d-P}S   (z)]. 


Then  by  Newey  (1994)  the  correction  term  for  estimation  of     P     is 

PA   (P){a(w)-E[a(w)|P]}(d-P).      Noting  that     E[a(w)  |  P,d=l]  =  E[da(w)  |P]/E[d  |  P]  =  E[a(w)|P], 

we  obtain  the  first  conclusion.      Also,     H      =  E[dm(z,e,a_)/ae]  =  -E[Pa(w){x-E[x|P]}'  ]. 

m  0 


Equation  (9)   is  then 


-E[aP(x-Epx)' ]  =  -E[(a-Epa)Px']  =  E[(a-EpaK2(a-Epa)'  ]  =  E[(a-Epa)7)2(a-Epa)' ], 


where  the     w     argument  is  suppressed  for  convenience.     Subtracting  gives 
-E[(a-Epa)(Px-7)2<a-Epa})']  =  0. 

Since     a(w)     can  be  any  function  of     w,     this  equation  implies  that 
Px-T)2(a-E  a)  =  h(P) 

2 
for  some  function     h(P)     of     P.     Dividing  through  by     T)       and  taking  conditional 

-2  -2 

expectations  given     P     gives     PE  tj     x  =  h(P)ET)     .      Solving,     h(P)   = 

-2  -2 

PE  i)     x/E  7)     ,     and  solving  again, 
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a  -  Epa  =  Pt)  2[x  -  Ep(T)  2x)/Ep(7,  2)]   =  Pt,  2(x  -  E[tj  2x|P]/E[tj  2|P]). 


Setting     a     equal  to  the  expression  on  the  right-hand  side,   and  noting  that     E[a|P]   =  0 
for  that  choice,   gives  the  second  result.     The  last  result  follows  from  equation  (3.20) 
of  Newey  and  Powell   (1993).      QED. 

Proof  of  Theorem  4.2:      Choose     A     =  A  (w)     ard     C       such  that  for     a     =  CA, 

Ellla-aJI2]  — >  0.      Then  since     EDa  =  0,     E[  llu— (z)-C¥u.(z)ll2]  =  E[li{(a-aT)-Eri(a-aI)}CII2]  =s 
J  P  mJJ  JPJ 

CEllla-ajll2!)2]  +  CE[IIEp(a-aJ)ll2T,2]   *  CE[  lla-ajll2]   +  CE[Eplla-ajll2]  s  CEMli-ajll2]  ->  0.      The 
conclusion  then  follows  by  Theorem  2.1.     QED. 


Proof  of  Theorem  4.3:      It  follows  as  for  the  conditional  mean  case  that  the  correction 

term  for  estimation  of     a       is     -E[a(w)  |  P,d=l]d<q(e,P)   -  A(P)},     where     A(P)  = 

E[dq(e,P)  |  P,d=l].     Similarly,   the  correction  term  for  estimation  of     P     is 

<a(w)-E[a(w)|P,d=l]}P{E[q   (e,P)|P]-A   (P)Md-P),     where     P     and     e     subscripts  will  denote 

corresponding  partial  derivatives,  giving  the  first  conclusion. 

To  solve  equation  (9)  for  the  optimal  moment  functions,   note     H      =  E[3m(z,9,a    1/90] 

m  0 

=  -E[a(w){dq   (e,P)x-E[q   (c,P)x  |  P,d=l]}].      Let     f(e|P)     denote  the  density  of     c     given     w 

and     d  =  1,     s     =  s   (e,P),      and     s„  =  s„(e,P).      For  notational  simplicity  let     q     = 
c  e  P         P  c 

q   (e,P)     and     q     =  qp(e,P).      Note  that     E[dq    |  w]  =  PE[dq    |w,d=l]   +  (l-P)E[dq    |w,d=0]  = 
PE[dq    |w,d=l]  =  PE  q   .     Therefore, 

H      =  -E[aP(EDq    )(x-EDx)' ]  =  -E[(a-EDa)P(EDq   )(x-EDx)'  ]. 
m  P  c  P  P  P  x  P 

Let     p  =  d[q(e,P)-A(P)].      Integration  by  parts  of     E  q     =  Tq   (c.P)f(e  |P)de, 
differentiating     A(P)  =  5  rq(e,P)f(e  |P)dc/9P     with  respect  to     P,     and  using     E  s     =  E  s 
0,     we  obtain 


EPqe  =  J,qe(e,P)f(c|P)de  =  -Epq(c,P)se  =  -Ep(pse),      Epqp  -  Ap(P)  =  -Ep(psp). 
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It  follows  that 


C  =  p  -  Ep(psp)P(d-P). 


Let     a     and     q     denote  the  optimal  functions,     p  =  d{q(e,P)-E   q(e,P)},      and     C,  =  p  - 

E   (ps    )P(d-P).      Note  that  by  conditional   independence,     E[pp|w]  =  PE   (pp).     Then  equation 

(9)  is 


E[(a-Ena)PEn(ps   )(x-EDx)'  ]  =  H      =  E[u    (z)u-(z)' ] 
Pre  P  m  mm 

=  El(a-E  a)E[<<|w](a-Epa)']  =  E[P(a-EpaKEp(pp~)  +  P2(l-P)Ep(psp)Ep(ps  )>(a-Epa)' 


As  in  the  proof  of  Theorem  4.1,   equality  for  all     a(w)     requires  that 


Ep(pse)(x-Epx)'    -  {Ep(pp)  +  P2(l-P)Ep(psp)Ep(ps   Ma-Epa)'    =  h(P). 


Taking  conditional  expectations  of  both  sides  given     P     we  find  that     h(P)   =  0.      This 
equation  will  hold  if     a  =  x,     and  for     g(P)  =  E   (ps   ), 

0  =  Ep(psc)  -  [E   (pp)  +  P2(l-P)Ep(psp)g(P)] 

=  -E[dq(e,P)<p  -  [s     -  P2(l-P)spg(P)]}|P,d=l]. 


For  this  equality  to  hold  for  any  function     q(c,P)     it  is  sufficient  that     q  = 

2  - 

[s  -P   (l-P)s   g(P)],      since     E   q  =  0.      Multiplying  through  by     s        and  taking  a 

2  2 

conditional  expectation  gives     g(P)   =  E   (s  s    )  -  P  (l-P)E  s  g(P).      Solving,     g(P)   = 

2  2 

E   (s  s   )/[l   +  P   (l-P)E  s    ].      An  optimal     q     is  then  given  by 

q(e,P)  =  s     -  g(P)sD,     g(P)  =  PE  (s  s   )/[P_1(l-P)_1  +  PE  (s2)]. 
c°Pto  pep  PP 


To  show  the  last  conclusion,   note  that     E  q  =  0,     so  that     p  =  dq.      Also,      E   (ps    ) 
g(P)P~2(l-P)_1,      so  that 
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<   =  dq  -  Ep(psp)P(d-P)   =  dq   -  g(P)P   V-P)   !(d-P). 


Therefore,     u— (z)  =   (x-E  x)^     matches  the  efficient  score  given  in  equation  (4.15)  of 
Newey  and  Powell  (1993),   giving  the  last  conclusion.      QED. 

Proof  of  Theorem  4.4:      By  arguments  like  those  above,     u  (z)  =  (x  -  E[x|P]){p     - 

•J  J 

P 

E[p  s    |P](d  -  P)>     for     p     =  d(p  -E[p    |P,d=l]).      Suppose  that  there  is     C       such  that  as 

J  — >  oo,      EldwJxXptcPj-C^p^cP)}2]  =  o(l)     for     w(x)  =   llx-E[x  |  P]ll2  + 
P(l-P)E[(sP)2|P]E[llx-E[x|P]ll2|P].     Then  for     C     =  C  ®I       and     e     =  p(e,P)-C'TpT(e,P), 
where     q     is  the  dimension  of     x,     E[llu— (z)-C'u  (z)ll    ]  =  E[llx-E[x|P]ll2<p-C'p  }2]  + 
2E[P(l-P)llx-E[x|P]ll2<E[<p-C^Pj(c,P)}sP|P]}2]   <  E[dw(x){eJ-E[eJ|P,d=l]>2]  -  2E[dw(x)e2]   + 
2E[dw(x)<E[ej|x,d=l]}2]  < 
4E[dw(x)s2]  =  o(l). 
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